IODS Course ProjectHi, I am Aarthi Ravindran, doing PhD in University of Eastern Finland. I heard about this course from AIVI coordinators, they circulated mail in UEF. In todays science field, we have excess of open data. If we learn the of art of dealing open data science then we can utlize them to our research. Click the link below to see my github repository and course diary webpage.
Aarthi Ravindran Github repository
Aarthi Ravindran Course Diary
This week was introduction class and hence it was all about getting familar with Rstudio, Github and Datacamp. I did following task like,
1. installation of tools
* R
* RStudio
* GitHub
2. Creating course templates
* GitHub
* RStudio
3. Learing R on Datacamp
Part:1 Hello R
Part:2 Getting Hooked on R
4. Rstudio Excercise1: Tasks and Instructions
*In this excercise, IODs project from Github is forked into my github project. It was then opened in Rstudio to edit the index.Rmd, chapter1.Rmd and chapter2.Rmd and filling the required information. The edited files are saved and knitted for updation. The saved and knit files are updated in Git from Rstudio using commit (everytime note the changes in commit) and push options. Once the updation is done, refresh your github project page and look for the changes you have made.
This week we started with doing data wrangling and with obtained data further regression and model validation is done.
1. Data Wrangling or Data Subsetting OVerhere, the data is subsetted into simple data for further regression analysis. The major data of 183 row with 60 columns are subset into 166 rows and 7 columns.
This subset data is stored in https://github.com/AarthiRavindran/IODS-project/data/
Let’s read the data and explore its dimension and structure.
# Exercise:1 Data Analysis
# Read the students2014 data into R either from your local folder (if you completed the Data wrangling part) or from this url: http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt . (The separator is a comma "," and the file includes a header). Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. There is information related to the data here. (0-2 points)
#Read the data
setwd("~/IODS-project/")
library(dplyr)
learning2014 <- read.table("~/IODS-project/data/learning2014.txt")
Explore the data by using dimension, structure and header commands
# Explore the data
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
By exploring data, it has the columns like age, gender, attitude, deep questions, strategy questions, surface questions and points. The data has 166 rows and 7 columns. This data has 110 females and 56 males with minimum age of 17 and maximum age of 55.
Graphical Overview
#Question: 2
#Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)
pairs(learning2014[!names(learning2014) %in% c("gender")],col=learning2014$gender)
#install.packages("GGally")
#install.packages("ggplot2")
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014,
mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20))
)
Result command: We can see the ratio of female gender is high, the minimum age and maximum of each gender various, the attitute, deep, stra are postively correlated to age whereas surf and points are negatively correlated to age, likewise we can see for each variables. The highesh correlation is between the points and attitute. deep and surf are having lowest negative correlation between them.
2. Linear REgression
#Question: 3
#Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)
#Question:4
#Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R squared of the model. (0-3 points)
Lets take the negatively correlation between the deep and surf and look close into it.
qplot(surf, deep, data= learning2014) + geom_smooth(method = "lm")
Let’s check how the linear model to this data works. The equation for the model is \[
Y_i = \alpha + \beta_1 X_i + \epsilon_i
\] whereas, Y = surf
X = deep
\(\alpha\) = constant \(\beta_1\) = regression coefficient for deep
\(\epsilon\) = random term.
Estimation of the model yields the following results:
my_model1 <- lm(surf ~ deep, data = learning2014)
results1 <- summary(my_model1)
knitr::kable(results1$coefficients, digits=3, caption="Regression coefficients")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 3.924 | 0.262 | 14.958 | 0 |
| deep | -0.309 | 0.071 | -4.383 | 0 |
Intercept as well as deep are not statistically significant predictors. Coefficient of determination \(R^2\) = 0.1048477 that is not particularly high.
From above graphs we can see the highest correlation between points and attitute. Let’s take them for further analysis.
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
Let’s start fitting the linear model for positively correlated Points and attitude which are having highest score too.
The equation for the model is \[
Y_i = \alpha + \beta_1 X_i + \epsilon_i
\] whereas,
Y = points
X = attitude
\(\alpha\) = constant
\(\beta_1\) = regression coefficient for attitude
\(\epsilon\) = random term.
Estimation of the model yields the following results:
my_model <- lm(points ~ attitude, data = learning2014)
results <- summary(my_model)
knitr::kable(results$coefficients, digits=3, caption="Regression coefficients")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 11.637 | 1.830 | 6.358 | 0 |
| attitude | 3.525 | 0.567 | 6.214 | 0 |
Intercept as well as attitude are statistically significant predictors. Coefficient of determination \(R^2\) = 0.1905537 that is not particularly high. Probably some more predictors could be added to the both model studied here.
Diagnostic Plots
#Question 5
#Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots. (0-3 points)
#Taking highly correlated variables for diagnostic plot
plot(my_model, which=c(1,2,5))
Result: From diagnostic plot, we can see the outlier. maybe if we remove the outlier, it is possible to get good Rsquare value and the model may fit more efficiently.
Third week class is about logistic regression.
Logistic REgression:
This is an statistical model that uses a logistic function to model a binary dependent variable. In logistic regression (or logit regression) is estimating the parameters in the form of binary (0 and 1) as we seen in previous exercise the use of linear regression is to estimate the parameters that in the continuous form (-infinity to +infinity).
Data wrangling (max 5 points) 1. Take the data file student-mat.csv and student-por.csv from given link.
2. Creat new rscript in ‘create_alc.R’ and save in data file 3. Read both student-mat.csv and student-por.csv files and explore the structure and dimensions of the data - (1 point- Done)
4. Join the two data sets using the variables “school”, “sex”, “age”, “address”, “famsize”, “Pstatus”, “Medu”, “Fedu”, “Mjob”, “Fjob”, “reason”, “nursery”,“internet” as (student) identifiers. Keep only the students present in both data sets. Explore the structure and dimensions of the joined data. (1 point - Done)
5. Either a) copy the solution from the DataCamp exercise The if-else structure to combine the ‘duplicated’ answers in the joined data, or b) write your own solution to achieve this task. (1 point - Done) 6. Take the average of the answers related to weekday and weekend alcohol consumption to create a new column ‘alc_use’ to the joined data. Then use ‘alc_use’ to create a new logical column ‘high_use’ which is TRUE for students for which ‘alc_use’ is greater than 2 (and FALSE otherwise). (1 point - done)
7. Glimpse at the joined and modified data to make sure everything is in order. The joined data should now have 382 observations of 35 variables. Save the joined and modified data set to the ‘data’ folder, using for example write.csv() or write.table() functions. (1 point - done)
Aarthi Ravindran, 13th September 2019
Data Wrangling - 5 points 1. Take the data file student-mat.csv and student-por.csv from given link.
2. Creat new rscript in ‘create_alc.R’ and save in data file.
3. Read both student-mat.csv and student-por.csv files and explore the structure and dimensions of the data.
# Reading the files
student_mat <- read.csv("C:/Users/Aarthi/Documents/IODS-project/data/student-mat.csv", sep = ";", header = TRUE)
student_por <- read.csv("C:/Users/Aarthi/Documents/IODS-project/data/student-por.csv", sep = ";", header = TRUE)
# Explore the files
dim(student_mat)
## [1] 395 33
dim(student_por)
## [1] 649 33
str(student_mat)
## 'data.frame': 395 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
str(student_por)
## 'data.frame': 649 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 4 2 6 0 0 6 0 2 0 0 ...
## $ G1 : int 0 9 12 14 11 12 13 10 15 12 ...
## $ G2 : int 11 11 13 14 13 12 12 13 16 12 ...
## $ G3 : int 11 11 12 14 13 13 13 13 17 13 ...
#Join the data
library(dplyr)
join_by <- c("school", "sex", "age", "address", "famsize", "Pstatus", "Medu", "Fedu", "Mjob", "Fjob", "reason", "nursery","internet")
student_mat_por_joined <- inner_join(student_mat, student_por, by=join_by)
dim(student_mat_por_joined)
## [1] 382 53
str(student_mat_por_joined)
## 'data.frame': 382 obs. of 53 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian.x : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime.x: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime.x : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures.x : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup.x : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup.x : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid.x : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities.x: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher.x : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic.x : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel.x : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime.x : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout.x : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc.x : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc.x : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health.x : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences.x : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1.x : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2.x : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3.x : int 6 6 10 15 10 15 11 6 19 15 ...
## $ guardian.y : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime.y: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime.y : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures.y : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolsup.y : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup.y : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid.y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ activities.y: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher.y : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic.y : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel.y : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime.y : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout.y : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc.y : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc.y : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health.y : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences.y : int 4 2 6 0 0 6 0 2 0 0 ...
## $ G1.y : int 0 9 12 14 11 12 13 10 15 12 ...
## $ G2.y : int 11 11 13 14 13 12 12 13 16 12 ...
## $ G3.y : int 11 11 12 14 13 13 13 13 17 13 ...
glimpse(student_mat_por_joined)
## Observations: 382
## Variables: 53
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G...
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 1...
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U...
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3,...
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4...
## $ Mjob <fct> at_home, at_home, at_home, health, other, service...
## $ Fjob <fct> teacher, other, other, services, other, other, ot...
## $ reason <fct> course, course, other, home, home, reputation, ho...
## $ guardian.x <fct> mother, father, mother, mother, father, mother, m...
## $ traveltime.x <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1...
## $ studytime.x <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3...
## $ failures.x <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ schoolsup.x <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no...
## $ famsup.x <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, ye...
## $ paid.x <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes...
## $ activities.x <fct> no, no, no, yes, no, yes, no, no, no, yes, no, ye...
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ higher.x <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,...
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, ye...
## $ romantic.x <fct> no, no, no, yes, no, no, no, no, no, no, no, no, ...
## $ famrel.x <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3...
## $ freetime.x <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2...
## $ goout.x <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3...
## $ Dalc.x <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Walc.x <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2...
## $ health.x <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2...
## $ absences.x <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4,...
## $ G1.x <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10...
## $ G2.x <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10...
## $ G3.x <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 1...
## $ guardian.y <fct> mother, father, mother, mother, father, mother, m...
## $ traveltime.y <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1...
## $ studytime.y <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3...
## $ failures.y <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ schoolsup.y <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no...
## $ famsup.y <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, ye...
## $ paid.y <fct> no, no, no, no, no, no, no, no, no, no, no, no, n...
## $ activities.y <fct> no, no, no, yes, no, yes, no, no, no, yes, no, ye...
## $ higher.y <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,...
## $ romantic.y <fct> no, no, no, yes, no, no, no, no, no, no, no, no, ...
## $ famrel.y <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3...
## $ freetime.y <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2...
## $ goout.y <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3...
## $ Dalc.y <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Walc.y <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2...
## $ health.y <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2...
## $ absences.y <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 1...
## $ G1.y <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12,...
## $ G2.y <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 1...
## $ G3.y <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 1...
# result intepretation : when i eplore joined data, it seems the duplicates are joined as x and y cordinates, so we have to remove the duplicates
#To combine the duplicates
# create a new data frame with only the joined columns
alc <- select(student_mat_por_joined, one_of(join_by))
# columns that were not used for joining the data
notjoined_columns <- colnames(student_mat)[!colnames(student_mat) %in% join_by]
# print out the columns not used for joining
notjoined_columns
## [1] "guardian" "traveltime" "studytime" "failures" "schoolsup"
## [6] "famsup" "paid" "activities" "higher" "romantic"
## [11] "famrel" "freetime" "goout" "Dalc" "Walc"
## [16] "health" "absences" "G1" "G2" "G3"
# for every column name not used for joining...
for(column_name in notjoined_columns) {
# select two columns from 'math_por' with the same original name
two_columns <- select(student_mat_por_joined, starts_with(column_name))
# select the first column vector of those two columns
first_column <- select(two_columns, 1)[[1]]
# if that first column vector is numeric...
if(is.numeric(first_column)) {
# take a rounded average of each row of the two columns and
# add the resulting vector to the alc data frame
alc[column_name] <- round(rowMeans(two_columns))
} else { # else if it's not numeric...
# add the first column vector to the alc data frame
alc[column_name] <- first_column
}
}
# glimpse at the new combined data
glimpse(alc)
## Observations: 382
## Variables: 33
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob <fct> teacher, other, other, services, other, other, othe...
## $ reason <fct> course, course, other, home, home, reputation, home...
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
#Take average of weekend and weekday alcohol consumption by creating new column in the data
alc <- mutate(alc, alc_use = (Dalc + Walc) / 2)
#Create new column by highuse in which the students who have value more than 2 is TRUE
alc <- mutate(alc, high_use = "alc_use" > 2)
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob <fct> teacher, other, other, services, other, other, othe...
## $ reason <fct> course, course, other, home, home, reputation, home...
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU...
write.csv(alc, “C:/Users/Aarthi/Documents/IODS-project/data/new_data.csv”) write.table(alc, “C:/Users/Aarthi/Documents/IODS-project/data/new_data.txt”)
Analysis (max 15 points) 1. Create ‘chapter3.Rmd’ and include it in ‘index.Rmd’ to perform analysis in chapter3.Rmd file. (Done)
2. Read the joined and print the variables in the data and describe it (0-1 point). (Done)
3. Select variables to study the relationships between high/low alcohol consumption of alcohol in the data. (0-1 point) (Done).
4. Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption. (0-5 points)
5. Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.(0-5 points) (Done)
6. Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. (0-3 points) (Done)
start of analysis About data:
This data is based on student alcohol consumption. The student who are present in two different data are taken and combined. The joined data set has 382 students and has description about their school name, sex, age of the student, their address, family size, parents status, mothers education, fathers education, mothers job, fathers job, activities, alcohol consumption per day and in weekend, their health, average of alcohol usage and column saying whether they are drinking high or not.
alc <- read.table("C:/Users/Aarthi/Documents/IODS-project/data/new_data.txt", sep = " ")
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The variables that might be associated or affected with the alcohol consumption from my point of view is sex, health, absences, failures and activities.
#install.packages("ggplot2")
#install.packages("tidyverse")
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.3 v purrr 0.3.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
gather(alc) %>% glimpse
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Observations: 13,370
## Variables: 2
## $ key <chr> "school", "school", "school", "school", "school", "schoo...
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 2 x 4
## # Groups: sex [2]
## sex high_use count mean_grade
## <fct> <lgl> <int> <dbl>
## 1 F TRUE 198 11.5
## 2 M TRUE 184 11.5
alc %>% group_by(activities, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 2 x 4
## # Groups: activities [2]
## activities high_use count mean_grade
## <fct> <lgl> <int> <dbl>
## 1 no TRUE 181 11.2
## 2 yes TRUE 201 11.7
alc %>% group_by(health, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 5 x 4
## # Groups: health [5]
## health high_use count mean_grade
## <int> <lgl> <int> <dbl>
## 1 1 TRUE 46 12.6
## 2 2 TRUE 44 11.5
## 3 3 TRUE 81 11.3
## 4 4 TRUE 67 11.4
## 5 5 TRUE 144 11.2
alc %>% group_by(absences, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 26 x 4
## # Groups: absences [26]
## absences high_use count mean_grade
## <int> <lgl> <int> <dbl>
## 1 0 TRUE 65 12.0
## 2 1 TRUE 51 11.6
## 3 2 TRUE 58 11.4
## 4 3 TRUE 41 11.7
## 5 4 TRUE 36 11.5
## 6 5 TRUE 22 11.3
## 7 6 TRUE 21 12.1
## 8 7 TRUE 12 11
## 9 8 TRUE 20 11.2
## 10 9 TRUE 12 10.5
## # ... with 16 more rows
alc %>% group_by(failures, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups: failures [4]
## failures high_use count mean_grade
## <int> <lgl> <int> <dbl>
## 1 0 TRUE 334 11.9
## 2 1 TRUE 24 8.67
## 3 2 TRUE 19 7.53
## 4 3 TRUE 5 7.4
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 + geom_boxplot() + ylab("grade")
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g2 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")
Find the correlation for the variables and alcohol consumption
m <- glm(high_use ~ failures + absences, data = alc, family = "binomial")
## Warning: glm.fit: algorithm did not converge
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## 2.409e-06 2.409e-06 2.409e-06 2.409e-06 2.409e-06
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.657e+01 2.422e+04 0.001 0.999
## failures -3.110e-08 3.136e+04 0.000 1.000
## absences 1.430e-09 3.347e+03 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 0.0000e+00 on 381 degrees of freedom
## Residual deviance: 2.2162e-09 on 379 degrees of freedom
## AIC: 6
##
## Number of Fisher Scoring iterations: 25
coef(m)
## (Intercept) failures absences
## 2.656607e+01 -3.109885e-08 1.429756e-09
#Find the logistic regression of variables failures, absences, sex, health, activities
library(dplyr)
m <- glm(high_use ~ sex + absences, data = alc, family = "binomial")
## Warning: glm.fit: algorithm did not converge
is.na(m)
## coefficients residuals fitted.values effects
## FALSE FALSE FALSE FALSE
## R rank qr family
## FALSE FALSE FALSE FALSE
## linear.predictors deviance aic null.deviance
## FALSE FALSE FALSE FALSE
## iter weights prior.weights df.residual
## FALSE FALSE FALSE FALSE
## df.null y converged boundary
## FALSE FALSE FALSE FALSE
## model call formula terms
## FALSE FALSE FALSE FALSE
## data offset control method
## FALSE FALSE FALSE FALSE
## contrasts xlevels
## FALSE FALSE
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
#CI <- confint(m) %>% exp
# print out the odds ratios with their confidence intervals
#cbind(OR, CI)
# fit the model
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
## Warning: glm.fit: algorithm did not converge
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
## failures absences sex high_use probability prediction
## 373 1 0 M TRUE 1 TRUE
## 374 1 7 M TRUE 1 TRUE
## 375 0 1 F TRUE 1 TRUE
## 376 0 6 F TRUE 1 TRUE
## 377 1 2 F TRUE 1 TRUE
## 378 0 2 F TRUE 1 TRUE
## 379 2 2 F TRUE 1 TRUE
## 380 0 3 F TRUE 1 TRUE
## 381 0 4 M TRUE 1 TRUE
## 382 0 2 M TRUE 1 TRUE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use TRUE
## TRUE 382
From this it is predicted the sex, absences and failures are affected by alcohol consumption.
Aarthi Ravindran.20/11/2019 - RStudio Exercise 4: Tasks and Instructions Note: for exercise 5 data wraningling please check the below link [https://github.com/AarthiRavindran/IODS-project/blob/master/create_human.R]
Analysis Exercises - 15 marks
Aarthi Ravindran.20/11/2019 - RStudio Exercise 4: Tasks and Instructions Note: for exercise 5 data wraningling please check the below link
Analysis Exercises - 15 marks
>>>>>>> 5aa31f087b0f9167ac82ca8e5dad3a717e6bbf02#load Boston data from Mass package (Exercise 4.2)
#The following R packages are installed already, which are mandatory to run this analysis
#install.packages("MASS")
#install.packages("corrr")
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(tidyr)
library(corrr)
library(dplyr)
#install.packages("corrplot")
library(corrplot)
## corrplot 0.84 loaded
#if(!require(devtools)) install.packages("devtools")
#devtools::install_github("ggobi/ggally")
#install.packages("GGally")
library(GGally)
data(Boston)
#Explore the Boston data (Exercise 4.2)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#Graphical Overview of data (Exercise 4.3)
pairs(Boston)
#Calculate the correlation of variables in the matrix
cor_matrix <- cor(Boston)%>%round(digits = 2)
#print correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
#visualize the correlation matrix
corrplot(cor_matrix, metho="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
#Another package for correlation matrix
# calculate the correlation matrix and round it
cor_matrix_gg <- ggcorr(Boston, palette = "RdBu", label = TRUE)
cor_matrix_gg
*Description of Boston Data:*
Bostan data is about the housing values in suburbs.This data is having 506 rows and 14 columns (using dim()), the columns are explained below,
crim - per capita crime rate by town, zn - proportion of residential land zoned for lots over 25,000 sq.ft., indus - proportion of non-retail business acres per town, chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise), nox - nitrogen oxides concentration (parts per 10 million), rm - average number of rooms per dwelling, age - proportion of owner-occupied units built prior to 1940, dis - weighted mean of distances to five Boston employment centres, rad - index of accessibility to radial highways, tax - full-value property-tax rate per $10,000, ptration - pupil-teacher ratio by town, pratio - pupil-teacher ratio by town, black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town, lstat - lower status of the population (percent), medv - median value of owner-occupied homes in $1000s.
Boston Data Variables Interpretation from Data Exploration and Graphical Plots:
From str() command we can see all the variables in the files are numeric or integer.
From summary() we can see minimum, maximum, medium, 1st and 3rd quartile of each varibales. For example, the crime rate ranges from minimum of 0.006 and maximum is 88.97, maximum of 27.74 proportion of non-retail busincess acer in boston, each housing has minimum of 3 room numbers to maximum of 8 rooms, the owners had them with different time points from minimum of 2.9 to maximum of 100 and so on.
The relationship between two dataset or variables in one data can be explained by the correlation among them. To define briefly, When two sets of data are strongly linked together we say they have a High Correlation. The word Correlation is made of Co- (meaning “together”), and Relation. Correlation is Positive when the values increase together, and. Correlation is Negative when one value decreases as the other increases.
From graphical plot pairs(), corrplot(), ggcor() which are from the correlation matrix, we can see some of the parameters are correlated either positively or negatively.
For example from the correlation plots we can clearly see,
crim is postively correlated with rad and tax,
zn is negatively correlated with indus, nox, age and postively correlated with dis
nox is postively correlated with age, rad, tax, istat and negatively correlated with dis, medv and so on.
From correlation plot we will be able to see clearly than the pairs, also ggcor cor also gives us the correlation values.
########################################################################################################################
#standardize the dataset
boston_scaled <- scale(Boston) #in scaling the outlier are removed
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
#Create the catagorical variable of crime rate from scaled data
#create the quantile vector
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
#create the categorical variable of crime
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# boston_scaled is available
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
#Check the structure of the new test data
str(boston_scaled)
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
dim(boston_scaled)
## [1] 506 14
Scaling Results: The scaling of data is important for data analysis, it remove the oulier and scale the data to fit in a modle, in this way we will avoid the bias in data analysis.
AS result of scaling we can see changes in maximum, minimum, 1st, 3rd quartile, mean and median of variables.
For example The max() of crime rate from 88.9 decreased to 9.92.
Then in this data, we know all the variables are numeric/integer, we are changing the crime rate as catagorical variable and split the data into train and test to check the how well the model works. Here 80% of data belong to train set, it is done by taking 80% of rows.
As the result of converting the crime rate into categorical variable, now we can clearly see data into four groups where 127 row in low crime rate, 126 row in medium low crime rate, 126 row in medium high and 127 row in high crime rate. The data is almost divided equally based on the crime rate.
#############################################################################################################################
Discussion: Before going into Exercise 4.5, we have to check whether the data is having categorical variables. From str() we can clearly see one of the variable “crime” has factor with four levels of low, med_low, med_high, high which is mandatory for Linear Discriminanat Analysis.
Linear Discriminant Analysis:
Linear Discriminant Analysis (LDA) is a dimensionality reduction technique. As the name implies dimensionality reduction techniques reduce the number of dimensions (i.e. variables) in a dataset while retaining as much information as possible. Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each case must have a score on one or more quantitative predictor measures, and a score on a group measure.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train) #this train data set has 80% of data
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
<<<<<<< HEAD
## 0.2500000 0.2376238 0.2400990 0.2722772
##
## Group means:
## zn indus chas nox rm
## low 0.84407367 -0.9368144 -0.07742312 -0.8564341 0.49485321
## med_low -0.03167061 -0.3227949 0.01475116 -0.6175537 -0.07216918
## med_high -0.38910901 0.1835926 0.13355755 0.4034275 0.22524717
## high -0.48724019 1.0169558 -0.02178633 1.0731962 -0.45176268
## age dis rad tax ptratio
## low -0.8632553 0.8022960 -0.6771295 -0.7696370 -0.39654757
## med_low -0.3915532 0.4342033 -0.5488034 -0.5081575 -0.08435174
## med_high 0.4402715 -0.4263911 -0.3934299 -0.2819290 -0.33850883
## high 0.8088365 -0.8572832 1.6397657 1.5152267 0.78268316
## black lstat medv
## low 0.38069798 -0.7789138 0.59864578
## med_low 0.32158081 -0.1798782 0.04932553
## med_high 0.08165893 -0.0098272 0.26231676
## high -0.76439264 0.9293921 -0.73255143
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09146791 0.60005090 -0.93933608
## indus -0.01435722 -0.29491248 0.62988276
## chas -0.07226597 -0.01231809 0.13582316
## nox 0.41308040 -0.74678337 -1.28453698
## rm -0.09831524 -0.14674824 -0.10903666
## age 0.23512478 -0.37259501 0.06543733
## dis -0.12012386 -0.20585777 0.50822451
## rad 3.08396428 1.04609112 0.02502282
## tax 0.13235853 -0.12729307 0.27582546
## ptratio 0.12253617 0.07678104 -0.32087692
## black -0.11939595 0.02966310 0.10484935
## lstat 0.21108191 -0.18489761 0.35814796
## medv 0.18903487 -0.37499734 -0.17878752
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9537 0.0350 0.0113
=======
## 0.2500000 0.2425743 0.2450495 0.2623762
##
## Group means:
## zn indus chas nox rm
## low 1.0034832 -0.9010948 -0.15538550 -0.8842032 0.4055971
## med_low -0.1243159 -0.2535782 0.04906687 -0.5657898 -0.1394810
## med_high -0.3729012 0.1569535 0.12535782 0.3284715 0.2583587
## high -0.4872402 1.0170298 -0.04947434 1.0559544 -0.4561416
## age dis rad tax ptratio
## low -0.9119357 0.9125446 -0.6964601 -0.7646435 -0.442738112
## med_low -0.3405873 0.3311928 -0.5482663 -0.4678459 0.009698635
## med_high 0.4112640 -0.3769823 -0.4250387 -0.3297825 -0.319124488
## high 0.7959571 -0.8489872 1.6390172 1.5146914 0.781811640
## black lstat medv
## low 0.37477549 -0.76649085 0.51801342
## med_low 0.33076887 -0.14040701 -0.01754658
## med_high 0.07773025 -0.05975297 0.29692282
## high -0.86082849 0.92236340 -0.71318334
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08787571 0.71158379 -0.84914296
## indus 0.05561741 -0.20691684 0.22119777
## chas -0.07120401 -0.05419421 0.22802986
## nox 0.42909737 -0.66152081 -1.30704071
## rm -0.11280863 -0.15986064 -0.12935336
## age 0.19541078 -0.38010352 -0.18388871
## dis -0.03708138 -0.24439676 0.05403422
## rad 3.38929531 1.02114614 -0.19140215
## tax 0.05784522 -0.13909744 0.62247473
## ptratio 0.08388646 0.02601791 -0.14504777
## black -0.10899877 0.02359101 0.14841266
## lstat 0.23744567 -0.29956619 0.32455942
## medv 0.19570156 -0.43114176 -0.27244402
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9543 0.0339 0.0117
>>>>>>> 5aa31f087b0f9167ac82ca8e5dad3a717e6bbf02
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
<<<<<<< HEAD
LD analysis Discussion: From the reduction plot, distribution of data into different clustering based on crime is clearly seen. Some of the med_high values are clustered well with high values than the manual clustering, it would be better to cluster them into high category. Also we can see other variables distribution according to crime rate in the dimensional biplot. For example variable rad is highly correlated with high crime rate.
=======
LD analysis Discussion: From the reduction plot, distribution of data into different clustering based on crime is clearly seen. Some of the med_high values are clustered well with high values than the manual clustering, it would be better to cluster them into high category. Also we can see other variables distribution according to crime rate in the dimensional biplot. For example variable rad is highly correlated with high crime rate.
>>>>>>> 5aa31f087b0f9167ac82ca8e5dad3a717e6bbf02
#############################################################################################################################
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
<<<<<<< HEAD
## low 20 6 0 0
## med_low 4 13 13 0
## med_high 1 14 13 1
## high 0 0 0 17
=======
## low 17 9 0 0
## med_low 8 14 6 0
## med_high 1 12 12 2
## high 0 0 0 21
>>>>>>> 5aa31f087b0f9167ac82ca8e5dad3a717e6bbf02
#compare the total number of new corrected one to old corrected one
#old uncorrected count
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
#newly corrected count
table(correct_classes)
## correct_classes
## low med_low med_high high
<<<<<<< HEAD
## 26 30 29 17
#or number in each level in corrected ones
lda.fit$counts
## low med_low med_high high
## 101 96 97 110
=======
## 26 28 27 21
#or number in each level in corrected ones
lda.fit$counts
## low med_low med_high high
## 101 98 99 106
>>>>>>> 5aa31f087b0f9167ac82ca8e5dad3a717e6bbf02
Discussion of predicted model: Manually we divide them based on numeric range, whereas through LDA analysis we can cluster the levels that is best fitted with reference to other variables in dataset. As we can see previously 127 row in low group, whereas 29 of them are corrected and 98 remain in the cluster and so on for each cluster.
#################################################################################################################################
#1. load the Boston data
library(MASS)
data('Boston')
#2. Scale the Boston Data and standardize the dataset
scaled_Boston <- scale(Boston)
#3. calculate the distances between the observations
# euclidean distance matrix
dist_eu <- dist(scaled_Boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(Boston, method = 'manhattan')
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.016 149.145 279.505 342.899 509.707 1198.265
#without scaling we get the following answer
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.119 85.624 170.539 226.315 371.950 626.047
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 2.016 149.145 279.505 342.899 509.707 1198.265
#4. Run K-means cluster
# k-means clustering
km <-kmeans(scaled_Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(scaled_Boston, col = km$cluster)
<<<<<<< HEAD
Discussion:
From the above kmeans clustering, the data is clustered based on the distance of variables. The pairs plot shows us the distrubution of clusters in each variable. Here i have divied into four center and it is showen in four different colors as seen in the picture. This clustering differ from LD analysis, which was done based on the catagorical variable wheres this is unsupervised clustering based on distance.
Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#install.packages("plotly")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
<<<<<<< HEAD
=======
>>>>>>> 5aa31f087b0f9167ac82ca8e5dad3a717e6bbf02